1 Introduction

1.1 Dataset description

The dataset used for developed this task is available on Kaggle (“Telco Customer Churn,” n.d.) and it contains information of a Telco company from \(7043\) clients, distributed across \(21\) different variables including gender, MonthlyCharges, and PhoneService, among others.

In the business sector, undertanding the number of customers gained and lost is crucial for maximazing profitability. Thus, analyzing if a client is thinking about changing its services into another company can help to evaluate and create new company’s strategies. This dataset is designed in order to predict customer churn, storagint the final decision of each client.

1.2 Variables description

The information contained in the dataset can be group into different areas:

Personal information.

  • customerID; A unique identifier for the client.

  • gender; The customer’s gender, Female or Male.

  • SeniorCitizen; Binary variable indicating if the customer is older than 65 (Yes) or not (No).

  • Partner; Binary variable which storage whether the customer has a partner (Yes) or not (No).

  • Dependents; Binary variable that indicates if the client lives with more people (Yes) or not (No).

Services subscribed.

  • tenure; The number of months the customer has been with the company by the end of the third quarter (3Q).

  • PhoneService; Whether the customer subscribes to home phone service (Yes or No).

  • MultipleLines; Whether the customer has more than one telephone lines (Yes or No). This service is on ly available if the user has active the PhoneService.

  • InternetService; Type of subscription to Internet service (No, DSL, Fiber Optic, or Cable).

The following extra-services are only available for user who has active the InternetService (the company does not charge any additional fee).

  • OnlineSecurity; Whether the customer subscribes to an online security service (Yes or No).

  • OnlineBackup; Whether the customer subscribes to an online backup service (Yes or No).

  • DeviceProtection; Whether the customer subscribes to a device protection plan (Yes or No).

  • TechSupport; Whether the customer subscribes to a technical support plan from the company with reduced wait times (Yes or No).

  • StreamingTV; Whether the customer use the Internet service to stream television programming (Yes, No).

  • StreamingMovies; Whether the customer use the Internet service to stream movies (Yes, No).

Account information

  • Contract; The type of contract that the customer have signed (Month-to-Month, One Year, Two Year).

  • PaperlessBilling; Whether the customer has chosen paperless billing (Yes or No).

  • PaymentMethod; Customer payment method (Bank Withdrawal, Credit Card, or Mailed Check)

  • MonthlyCharges; Customer monthly charges for all services from the company.

  • TotalCharges; Total expenses calculated to the end of the 3Q (third quarter).

Target variable

  • Churn; Whether the customer left the company (Yes) this quarter or stay with the company (No).

1.3 Goal

The objective of this assignment is to predict the Churn variable, which represents whether a client will remain with or leave from the company’s services within the coming time period. This prediction task is a binary classification problem where the goal is to forecast one of the two possible outcomes: churn yes (leave the company) or churn no (stay with the company services).

This valuable information can provide insights to the company, that allow it to form better strategies and enhance customer retention by optimizing service offering, as well as increase its profitability.

2 Data preprocessing

This section focuses on preparing the dataset for exploratory data analysis (EDA) and applying the computational models. This include loading the necessary material, handling missing values, encoding categorical variables, and splitting the dataset into training and testing subsets, among other tasks.

2.1 Load material.

Load libraries. Several libraries are need for the project in order to work properly.

library(skimr) # Summary statistics
library(tidyverse) # Include ggplot2, dplyr, among others
library(mice) # Imputation missing data
library(VIM) # Imputation graph
library(caret) # Models
library(GGally) # Plots of the continuous variables
library(pander) # For making tables
library(gridExtra) # For grid.arrange
library(reshape2) # Data reshaping (melt)
library(corrplot) # Correlation matrix
library(patchwork)
library(knitr)
library(psych) # Compute the correlation matrix
library(polycor) # Compute the correlation heatmap
library(pROC) # ROC curve analysis
library(plotly) # Interactive plot

Load dataset. Import the dataset.

data <- read.csv("Telco-Customer-Churn.csv", header = TRUE, sep = ",")

First look at the dataset. Brief overview of the dataset for further preprocessing steps.

glimpse(data)
## Rows: 7,043
## Columns: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…

2.2 Delete unnecessary information.

Some variables are removed due to its usefulness contribution on the model for making predictions. For instance, customerID is just the unique identifier for each customer. On the other hand, PaperlessBilling parameter, nowadays does not have too much sense. The main reason is that with the digital era we live in, most customers choose for internet payments. Thus this columns are exclude from the analsys.

# Remove unnecessary variables
data <- data %>% dplyr::select(-c(customerID, PaperlessBilling))

2.3 Data cleaning.

By analyzing the dataset, we can realized that certain variables are closely related, therefore some modification are apply.

Combine PhoneService and MultipleLines. The first variable indicates if the client has subscribed to home phone service or not. Only when the client has hired this service, he can choose between having \(1\) phone line (MultipleLines=No) or have more than \(1\) (MultipleLines=Yes). This relationship lead us into three possible scenarios:

  1. No phone service. Customers who have not subscribed to phone service (PhoneService=No and MultipleLines=“No phone service”).

  2. Single phone line. Customers with phone service and a single line (PhoneService=Yes and MultipleLines=No).

  3. Multiple phone lines. Customers with phone service and multiple lines (PhoneService=Yes and MultipleLines=Yes).

To simplify this, it is introduced a new variable (PhoneServiceType) with three categories corresponding to the above scenarios:

  • NoPhone: if there is no phone service.

  • PhoneSingle: if there is phone service with a single line.

  • PhoneMultiple: if there is phone service with multiple lines.

InternetService and its associate variables. A similar situation happens with InternetService and its related services (OnlineSecurity, StreamingTV, among others). Clients without internet service cannot sign up to these additional services, which are denoted as “Yes”, “No”, or “No internet service” in the dataset. We can simplify this by considering “No internet service” as equivalent to “No”, thereby converting these variables into binary classes.

Note. For both situation, it has been computationally verified that there is consistency between the relationships of these variables. Thus avoiding the existence of possible nonsense situations.

## Unique values of each variable ----------------------------------------------
cat_vars <- sapply(data, is.character) # Select qualitative variables
lapply(data[, cat_vars], unique) # Print each unique values


## PhoneService and MultipleLines ----------------------------------------------
# Subset with clients where PhoneService is "No"
subset_no_Phone <- data[data$PhoneService == "No", ]
# Check MultipleLines is set to "No phone service"
if (all(subset_no_Phone$MultipleLines == "No phone service")) {
  cat("In the subset where PhoneService is 'No', 
      MultipleLines is set to 'No phone service' only.\n")
} else {
  cat("In the subset where PhoneService is 'No', 
      MultipleLines is not exclusively set to 'No phone service'.\n")
}

# Combine PhoneService and MultipleLines on a new variable (PhoneServiceType)
# Convert into a factor
data <- data %>%
  mutate(PhoneServiceType = case_when(
    MultipleLines == "No phone service" ~ "NoPhone",
    MultipleLines == "No" ~ "PhoneSingle",
    MultipleLines == "Yes" ~ "PhoneMultiple"
  )) %>%
  mutate(PhoneServiceType = factor(PhoneServiceType, levels = c("NoPhone", "PhoneSingle", "PhoneMultiple")))

# Remove original variables
data <- dplyr::select(data, -PhoneService, -MultipleLines)


## Consequences Internet service -----------------------------------------------
# Subset with clients where InternetService is "No"
subset_no_Internet <- data[data$InternetService == "No", ]

vars_to_check <- c("OnlineSecurity", "OnlineBackup", "DeviceProtection",
                   "TechSupport", "StreamingTV", "StreamingMovies")

# Check its associate variables are set to "No internet service"
for (variable in vars_to_check) {
  unique_values <- unique(subset_no_Internet[[variable]])
  cat("Unique values for", variable, ":", unique_values)
  cat("\n")
}

# Reduce to a binary variable ("No internet service" -> "No")
for (variable in vars_to_check) {
  data[[variable]] <- ifelse(data[[variable]] == "No internet service", "No", data[[variable]])
}

2.4 Handling missing data.

In the dataset, a small observation of the variable TotalCharges contains missing values, only the \(0.156\)% of the customers. To handle these missing values, we have several options:

  • Delete this rows; Due to the low presence of missing values, deleting these observation will not affect significantly the dataset.

  • Set NA’s to \(0\); Since these observations correspond to new clients we can assume it has not passed enough time to compute this value and stablishing to zero can be a good approach.

  • Imputation; We can impute the missing data using techniques provided by the R library mice (van Buuren and Groothuis-Oudshoorn 2011). By default, mice generates \(5\) imputations. However, for this project, we will only consider the first imputation. We will use the ‘pmm’ method (predictive mean matching), which imputes data based on the mean as a reference. After generating the imputed data, we can complete the missing values using the complete() function, which fills in missing values with those from the first imputation.

To check there are no missing data anymore, we use a graph from the library(VIM) (Kowarik and Templ 2016), which display the percentage of NA’s.

Also, to assess whether the imputed data adequately reflects the distributions of the original data, we can use the densityplot() function. This function shows the marginal distribution of the observed data in blue, and in red, it shows the \(5\) densities for the predictor with initially missing data.

## MICE Imputation -------------------------------------------------------------
meth <- rep("", ncol(data)) # Set all methods to "" (no imputation)
names(meth)[names(meth) == "TotalCharges"] <- "pmm" # Specify TotalCharges method

mice_model <- mice(data,
  m = 5, maxit = 5, method = "pmm",
  seed = 1234, print = FALSE
)
data <- complete(mice_model, 1) # Complete the data with the first imputation

As we can observed on the density plot, the distribution does not fit as properly as it should be. Hence, setting this observation to \(0\) or deleting them, might be a better option. However, due to the low presence of missing values on the dataset, we will continue the project with this imputation technique.

2.5 Encoding categorical variables.

Once we have the dataset ready, the goal of this section is to get a dataset that our statistical models knows how to manage. This include, turn it into factor the categorical variables.

Binary variables. There are a total of \(11\) binary variables: gender, SeniorCitizen, Partner, Dependents, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies and Churn. All this variables has been transformed into factor, with two possible outcomus “No” or “Yes”.

Multi-sate variables. There are \(4\) variables of this kind of data: InternetService, Contract, PaymentMethod, and PhoneServiceType. As well as before, I will create factor levels for each possible outcome. Notice that this starategy can be follow due to the absence of any kind of “natural order”.

## Binary encoding -------------------------------------------------------------
binary_variables <- c(
  "Partner", "Dependents", "OnlineSecurity", "OnlineBackup",
  "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies", "Churn"
)
for (var in binary_variables) {
  data[[var]] <- factor(data[[var]], levels = c("No", "Yes"))
}

data$gender <- factor(data$gender)
levels(data$gender) # Check the levels

unique(data$SeniorCitizen) # It is already transform in 1's and 0's
data$SeniorCitizen <- factor(data$SeniorCitizen, 
                             levels = c(0, 1), 
                             labels = c("No", "Yes"))
levels(data$SeniorCitizen)

## Multi-state variables -------------------------------------------------------
# Rename levels of all variables needed to avoid misunderstands later

# InternetService to a factor and rename levels
data$InternetService <- factor(data$InternetService,
                               levels = c("DSL", "Fiber optic", "No"),
                               labels = c("DSL", "FiberOptic", "No"))
levels(data$InternetService) # Check the levels


# Contract to a factor and rename levels
data$Contract <- factor(data$Contract,
                        levels = c("Month-to-month", "One year", "Two year"),
                        labels = c("MonthToMonth", "OneYear", "TwoYear"))
levels(data$Contract) # Check the levels


# PaymentMethod to a factor and rename levels
data$PaymentMethod <- factor(data$PaymentMethod,
                             levels = c("Electronic check", 
                                        "Mailed check", 
                                        "Bank transfer (automatic)",
                                        "Credit card (automatic)"),
                             labels = c("ElectronicCheck",
                                        "MailedCheck", 
                                        "BankTransfer",
                                        "CreditCard")
                             )
levels(data$PaymentMethod) # Check the levels

2.6 Split the dataset

Final step before applying the statistical tools consist on divide the dataset into two subsets. The training set, takes \(80\)% of the data and it is used to train the models. This subset allows the model to learn patterns and relationships within the data. On the other hand, testing set (\(20\)% of the data) is used by unseen data to evaluate the performance of the trained model.

set.seed(1234) # For reproducibility

# Split into training (80%) and testing (20%) set
index <- createDataPartition(data$Churn, p=0.8, list=FALSE)
training <- data[ index,]
testing <- data[-index,]

nrow(training) # Number of observation for training
nrow(testing) # Number of observation for testing

3 Exploratory Data Analysis (EDA)

The aim is to uncover patterns, detect outliers, and test assumptions through visual and quantitative methods.

3.1 Numeric variables.

ggpair function from the GGally package, allow us to create a scatter plot matrix for this three variables. This matrix includes histograms for each variable along the diagonal, showcasing their distribution. The lower triangle displays scatter plots to explore the relationships between variable pairs, while the upper triangle shows the correlation coefficients, providing a quantitative measure of their relationships. This information allow us to identify the shape distribution, tendencies, outliers, among other things, thanks to different views in one grid.

# Select continuous variables (include target)
continuous_data <- training[, c("tenure", "MonthlyCharges", "TotalCharges", "Churn")]

# Create scatter plot matrix
ggpairs(continuous_data,
        aes(fill="pink"),
        lower = list(continuous = "points", combo = "box_no_facet"),
        upper = list(continuous = "cor"),
        diag = list(continuous = "barDiag"),
        title = "Scatter plot matrix"
       )

  • Tenure

The histogram shows a bimodal distribution, with peaks at the extremes of the range. This implies that most of the customers are either new (low peak) or very loyal (high peak), with fewer clients in the mid-range.

  • MonthlyCharges

Displays a significant peak at the begging of the distribution, suggesting a big amount of clients that pays around 25$ per month. On the other hand, the distribution becomes more uniform for charges between \(60\)$ and \(120\)$.

  • TotalCharges

It follows a right skewness distribution, which means most of the people are subscribe to services that require to pay less amount of money. It is clear, that clients are willing to pay no more than \(1250 \$\). For trying to achieve a normal distribution I will apply a log transformation on this variable.

General insights.

This analysis reveals non-normal distribution for the numeric variables due to the used of some statistical methods, the data will be standardized. In addition, it is worth mentioning the high correlation that exist between tenure and TotalCharges (correlation coefficient of \(0.823\)). Finally, it is also notably that this two parameters presents some outliers.

Relation with the output.

If we keep doing a deeper exploration, customers with lower total charges are more likely to leave (Churn=Yes). Furthermore, the second insight is regarding the tenure. The density plot lead us into the conclusion that clients tends to leave within the first year. This could be because they find better alternatives or due to dissatisfaction with the company.

Data transformation code.

# Numeric transformation
# TRAINING dataset -------------------------------------------------------------
training_transf <- training
# Log-transformation to solve TotalCharges right skewness
training_transf$TotalCharges <- log(training_transf$TotalCharges)
# Standardizing the variables
training_transf[c("tenure", "MonthlyCharges", "TotalCharges")] <- scale(training_transf[c("tenure", "MonthlyCharges", "TotalCharges")])

# TESTING dataset --------------------------------------------------------------
testing_transf <- testing
# Log-transformation to solve TotalCharges right skewness
testing_transf$TotalCharges <- log(testing_transf$TotalCharges)
# Standardizing the variables
testing_transf[c("tenure", "MonthlyCharges", "TotalCharges")] <- scale(testing_transf[c("tenure", "MonthlyCharges", "TotalCharges")])

3.2 Cualitative variables.

Binary variables.

On the below table, we observe that gender as well as Partner are data equally represented on the dataset, with both categories close to a \(50\%\) split. Meanwhile, only approximately \(16\)% of the observations belong to senior citizens. This information aligns with the demographic expectation.

Category SeniorCitizen Partner Dependents OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies
No 4716 2921 3927 4022 3693 3724 4016 3492 3442
Yes 920 2715 1709 1614 1943 1912 1620 2144 2194
gender n
Female 2787
Male 2849

The following plots aim to show the churn distribution across each binary variable. We realized that customer who leaves does not differ by gender or whether they have partner or not. It looks like, people who are subscribed to extra-services such as online security, online backup, device protection, and tech support or has dependents, they tend to stay at the company (Churn=No) most of the time.

Category variables.

Next step consist on explore the \(4\) multi-state variables: InternetService, Contract, PaymentMethod, and PhoneServiceType. Some insights from this plots are that a majority of customers prefer having internet as well as phone services, suggesting they are essential services. However, there is no a favorite payment method, more less equally distributed. Lastly, it looks like clients most popular choice is to revenue the contract month by month.

Finally, if we study the relationship of these variables with the target we realized people with two year contracts tends to be loyaler clients (which are less likely to change to another company). While customer paying by electronic check have a higher probability to abandon the company.

3.3 Correlation matrix

The heatmap represents the strength and direction of the relationships between the variables in the dataset, where the color intensity indicates the magnitude of the correlation. Dark colors, that can be either blue (positive) or red (negative), suggest a stronger relationship between variables. A positive coefficient means that an increase (or decrease) in one variable is associated with an increase (or decrease) in the other. Conversely, a negative correlation indicated that as one variables increase, the other decrease. And, color in the middle are interpreted as weak correlations.

For dealing with mixed data type, the hetcorfunction from the psych (William Revelle 2023) package in R is used. This enable the computation of correlation across numeric, binary and multi-state variables as long as the qualitative variable are represented in factors.

# Works for mix variables: binay, multi-state and numeric
# Qualitative variables have to be in factor

cor_matrix <- hetcor(training_transf)
cor_values <- cor_matrix$correlations
corrplot(cor_values, method="color", type="upper", order="hclust",
         tl.cex=0.7, tl.col="black", tl.srt=45)

Analytical exploration

Coefficients higher than \(0.7\) or lower than \(-0.7\) are considered strong. In this dataset we encounter \(5\) strong relationship, that can be shown on the below table.

Most of this relationships are kind of obvious. For instance, if the client spend more time having the company services, then it accumulated charges will be higher too. However, we can get an interesting insight: customers who subscribe to one streaming service are likely to subscribe to the other (StreamingMovies-StreamingTV = \(0.7526\)).

cor_flat <- as.data.frame(as.table(cor_values))
cor_flat <- cor_flat[order(-abs(cor_flat$Freq)),]

# Filter correlations: >0.7 or <-0.7
strong_correlations <- subset(cor_flat, abs(Freq) > 0.7 & Var1 != Var2)

# Remove duplicates rows
strong_correlations$pairID <- apply(strong_correlations[, c("Var1", "Var2")], 
                                    1,
                                    function(x) paste(sort(x), collapse = "-"))
strong_correlations <- strong_correlations[!duplicated(strong_correlations$pairID),]
strong_correlations$pairID <- NULL # Remove ID

# Add sign
strong_correlations$Sign <- ifelse(strong_correlations$Freq > 0, "+", "-")

pander(strong_correlations)
  Var1 Var2 Freq Sign
88 TotalCharges tenure 0.8338 +
195 MonthlyCharges StreamingTV 0.7629 +
213 MonthlyCharges StreamingMovies 0.7549 +
192 StreamingMovies StreamingTV 0.7526 +
85 Contract tenure 0.7143 +

3.4 Target variable

To see the distribution of our target variable (Churn), we first look at the raw counts and then at the percentages to understand the proportion of customers who leave versus those who stay with the company’s services.

# Check counts of Churn
pander(table(data$Churn))
No Yes
5174 1869
# Results in percentage
pander(prop.table(table(data$Churn)) * 100)
No Yes
73.46 26.54

The dataset is slightly imbalance in the distribution of the Churn variable. There are more observations from clients who decide to stay with the company services (Churn=No) than who decide to leave. Although the existence of imbalance on the dataset, it is relatively minor.

After applying a sampling technique in order to solve this issue, no improvement was observed in the model’s performance. In fact, the results were worse compared to using the original data proportions. This results suggest that the imbalance may not impact the model’s accuracy. Therefore, I have decided not to apply any methods for balancing the dataset when computing the statistical tools. However, code for performing upsampling is provided in the Annex section.

4 Statistical tools

The objective of this section is to evaluate the performance of some classification models for the binary classification task. The goal is to predict whether the customer intends to stop using the company’s services or not. For each of these techniques, we will compute the confusion matrix, a tool which will indicate us how many instances were correctly classified and how many were misclassified.

First, the dataset is divided into predictors and the target variable.

# Divide into predictors and target both subsets
X_train <- training_transf %>% dplyr::select(-Churn)
y_train <- training_transf$Churn

X_test <- testing_transf %>% dplyr::select(-Churn)
y_test <- testing_transf$Churn
n_test <- dim(testing_transf)[1]

To ensure a robust evaluation, a common training control parameter is configure: \(5\) repeats of \(10\)-fold cross-validation.

# Set up train control -> 5 repeats of 10-fold cross validation
train_ctrl <- trainControl(method = "repeatedcv",
                           repeats = 5,
                           number = 10)

Finally, it is worth mentioning that we generated a plot for all models with the posterior probabilities of churn for each observation in the test set. The point represent each predictions with the model we are evaluating, where blue points are the customers that decide to stay (Churn=No) and red points the ones that probably will abandond the company (Churn=Yes). The dashed horizontal line at \(0.5\) serves as a decision threshold, indicating the probability level at which the model switches from predicting a customer will not churn to predicting they will churn. This allow us to give a visually representation of how the model is predicting.

4.1 Bayes classifiers

4.1.1 QDA

QDA refers to Quadratic Discriminant Analysis, this model assumes a Gaussian distributions of the predictors for each class but its covariance matrix can be different.

# Train the QDA model using caret
model_QDA <- train(Churn ~ .,    
                    data = training_transf,
                    method = "qda",
                    trControl = train_ctrl) # Use the defined train control

# Predictions
predictions_QDA <- predict(model_QDA, X_test)
# head(predictions_QDA)

# Posterior probabilities
post_prob_QDA <- predict(model_QDA, X_test, type = "prob")
# head(post_prob_QDA)

# Performance measures
confMat_QDA <- table(predictions_QDA, testing_transf$Churn)
addmargins(confMat_QDA)
##                
## predictions_QDA   No  Yes  Sum
##             No   797   86  883
##             Yes  237  287  524
##             Sum 1034  373 1407
## [1] "------ Testing measures ------"
## [1] "Error for QDA: 0.229566"
## [1] "Accuracy for QDA: 0.770434"
# Plot
ggplot(data = testing_transf, aes(x = seq_along(testing_transf$Churn))) +
  geom_point(aes(y = post_prob_QDA[,"Yes"], color = Churn)) +
  theme_light(base_size = 14) +
  scale_color_manual(values = c("No" = "deepskyblue2", "Yes" = "firebrick2")) +
  xlab("Individual") +
  ylab("Probability of Churn with QDA") +
  geom_hline(yintercept = 0.5, linetype = "dashed")

The confusion matrix and the plot represent the same idea in different manner. From the plot, we observe that the red points below and blue points above the threshold line are point that has been misclassifications. Meanwhile in the confusion matrix this clients can be found over the no-diagonal values.

We can conclude that the model:

  • It is good at identifying the customers who will stay and actually stay at the company (\(90.26\)% right).

  • The worst mistake is when the model predict the client will stay and he actually left, and with QDA happens \(86\) times.

  • The other possible error is when it predict the client will leave, but its real choice is that he will continue with the company (\(237\) wrong identifications).

  • Finally, the model just gets right un \(54.77\)% when predicting the client will leave and actually leave.

  • Furthermore, the kappa value is not too high.

4.1.2 LDA

Linear Discriminant Analysis (LDA) is a simplified version of QDA model which introduce some bias to reduce the variance.

# Train the LDA model using caret
model_LDA <- train(Churn ~ .,    
                    data = training_transf,
                    method = "lda",
                    trControl = train_ctrl) # Use the defined train control

# Predictions
predictions_LDA <- predict(model_LDA, X_test)
# head(predictions_LDA)

# Posterior probabilities
post_prob_LDA <- predict(model_LDA, X_test, type = "prob")
# head(post_prob_LDA)

# Performance measures
confMat_LDA <- table(predictions_LDA, testing_transf$Churn)
addmargins(confMat_LDA)
##                
## predictions_LDA   No  Yes  Sum
##             No   934  163 1097
##             Yes  100  210  310
##             Sum 1034  373 1407
## [1] "------ Testing measures ------"
## [1] "Error for LDA: 0.186923"
## [1] "Accuracy for LDA: 0.813077"
# Plot
ggplot(data = testing_transf, aes(x = seq_along(testing_transf$Churn))) +
  geom_point(aes(y = post_prob_LDA[,"Yes"], color = Churn)) +
  theme_light(base_size = 14) +
  scale_color_manual(values = c("No" = "deepskyblue2", "Yes" = "firebrick2")) +
  xlab("Individual") +
  ylab("Probability of Churn with LDA") +
  geom_hline(yintercept = 0.5, linetype = "dashed")

This model achieved an accuracy of \(81.3077\)% improving the QDA performance, thus we can conclude it is able to make good predictions about if a client will churn or not.

However compared to QDA, the errors that suppose worse for the company have increase from \(86\) to \(163\). However, the mistakes from a client leaving the company but actually stay they really decrease. As before it accurate will identify people who will stay with the company.

4.1.3 Naïve Bayes

Model useful when the dataset is very large and on multi-class predictors. For this dataset might not work as well as QDA or LDA. On the contrary, it usually works good for predictor that are independence one from each other.

# Train Naive Bayes model
model_NB <- train(x = X_train, 
                  y = y_train, 
                  method = "naive_bayes", 
                  trControl = train_ctrl,
                  tuneLength = data.frame(laplace = 0.5,
                                      usekernel = FALSE,
                                      adjust = FALSE))


# Predict on test data
predictions_NB <- predict(model_NB, newdata = X_test)
# head(predictions_NB)

# Posterior probabilities
post_prob_NB <- predict(model_NB, X_test, type = "prob")
# head(post_prob_NB)

# Evaluate the model using confusion matrix
confusionMatrix(predictions_NB, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  846 108
##        Yes 188 265
##                                           
##                Accuracy : 0.7896          
##                  95% CI : (0.7674, 0.8107)
##     No Information Rate : 0.7349          
##     P-Value [Acc > NIR] : 1.127e-06       
##                                           
##                   Kappa : 0.4947          
##                                           
##  Mcnemar's Test P-Value : 4.395e-06       
##                                           
##             Sensitivity : 0.8182          
##             Specificity : 0.7105          
##          Pos Pred Value : 0.8868          
##          Neg Pred Value : 0.5850          
##              Prevalence : 0.7349          
##          Detection Rate : 0.6013          
##    Detection Prevalence : 0.6780          
##       Balanced Accuracy : 0.7643          
##                                           
##        'Positive' Class : No              
## 
# Plot
ggplot(data = testing_transf, aes(x = seq_along(testing_transf$Churn))) +
  geom_point(aes(y = post_prob_NB[,"Yes"], color = Churn)) +
  theme_light(base_size = 14) +
  scale_color_manual(values = c("No" = "deepskyblue2", "Yes" = "firebrick2")) +
  xlab("Individual") +
  ylab("Probability of Churn with Naïve Bayes") +
  geom_hline(yintercept = 0.5, linetype = "dashed")

In summary, the model achieved an accuracy of \(78.96\)%, which, while is not the highest accuracy among the models tested, is still pretty good. It demonstrates a string ability to correctly identify the client that will remain with the company (Positive predictive Value). However, regarding to the wrong predictions, it can be notice that it makes \(108\) huge mistakes (which is in the average of the previous models).

4.2 Logistic regression

Logistic regression is a statistical method for predicting binary problems which allow also use categorical predictors. It looks like a model that can fit pretty well with my classification task. However, it must be notice that it is sensitive to outliers on the data.

# Train Logistic Regression classifier
model_LR <- train(Churn ~.,
                  data = training_transf,
                  method = "glm",
                  family = "binomial", # For binary logistic regression
                  trControl = train_ctrl)

# Predict on test data
predictions_LR <- predict(model_LR, X_test)
# head(predictions_LR)

# Posterior probabilities
post_prob_LR <- predict(model_LR, X_test, type = "prob")
# head(post_prob_LR)

# Evaluate the model using confusion matrix
confusionMatrix(predictions_LR, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  944 168
##        Yes  90 205
##                                           
##                Accuracy : 0.8166          
##                  95% CI : (0.7954, 0.8365)
##     No Information Rate : 0.7349          
##     P-Value [Acc > NIR] : 3.431e-13       
##                                           
##                   Kappa : 0.4957          
##                                           
##  Mcnemar's Test P-Value : 1.636e-06       
##                                           
##             Sensitivity : 0.9130          
##             Specificity : 0.5496          
##          Pos Pred Value : 0.8489          
##          Neg Pred Value : 0.6949          
##              Prevalence : 0.7349          
##          Detection Rate : 0.6709          
##    Detection Prevalence : 0.7903          
##       Balanced Accuracy : 0.7313          
##                                           
##        'Positive' Class : No              
## 
# Plot
ggplot(data = testing_transf, aes(x = seq_along(testing_transf$Churn))) +
  geom_point(aes(y = post_prob_LR[,"Yes"], color = Churn)) +
  theme_light(base_size = 14) +
  scale_color_manual(values = c("No" = "deepskyblue2", "Yes" = "firebrick2")) +
  xlab("Individual") +
  ylab("Probability of Churn with Logistic regression") +
  geom_hline(yintercept = 0.5, linetype = "dashed")

To conclude, the model’s accuracy is \(81.6631\)% so it can correctly predict both retainment and churn. The sensitivity is high, showing case the model is very good at identifying customers who will stay with the company services. However, this model tend to predict wrongly those clients which will leave (specificity). Finally, we can observed that the most important errors achieved to \(168\) wrong points classification.

4.3 Incorporing economic impact

For incorporating the economic impact that can be cause by the remain or left of a customer on a Telco company, I will compute the cost for the best accuracy model: Logistic regression classifier. Notice it presents the minor average errors and highest accuracy (see the chart below).

Recall (Churn); Yes = the customer left the company, and No = the customer remained with the company.

When a model makes wrong prediction, we must take into account that not all errors carry the same financial consequences. For instance, if the model predicts that a customer will stay (when they actually leave) this can turn into a loss of the expected revenue from that customer. On the other hand, mistakenly predicting a client will leave (and they actually stay) does not have a huge influence on the company, but it might lead into a loss of an opportunity.

Thus, it is important to understand correctly the confusion matrix for the chosen model (logistic regression classifier). Worst error in the below table is the one committed \(168\) times. Then, the goal is to decreasing this error by changing the probability threshold. The optimal threshold can be selected using some specific economic effects.

##               
## predictions_LR  No Yes
##            No  944 168
##            Yes  90 205

Economic assumption used.

  • If the company predicts a customer is going to stay and it actually stay, the company gains a 12% profit at the end of the quarter.

  • If the company predicts a customer is going to stay but he leaves, the loss is 100%.

  • If the company predict a customer is going to leave but he actually stay, the profit is 10%.

  • If the company predict a customer is going to leave and he actually leave, then the profit is 3%.

Using the specified profits and losses for each prediction outcome, we compute the average profit per customer based on the model’s predictions.

# Confusion matrix values
TN = 944
FP = 168
FN = 90
TP = 205

# profit/loss in %
profit_TN = 0.20
loss_FP = -1.0
profit_FN = -0.01
profit_TP = 0.03

# Compute profit per client
profit_per_client = (TN * profit_TN + FP * loss_FP + FN * profit_FN + TP * profit_TP) / (TN + FP + FN + TP)
sprintf("Profit per customer: %f", profit_per_client)
## [1] "Profit per customer: 0.018515"

The goal is to improve this economic output by reducing the most costly mistakes (costumers that are predicted to stay but actually change into another company).

profit_unit <- c(0.20, -0.01, -1.0, 0.03)

## Selecting the optimal threshold
profit_i = matrix(NA, nrow = 50, ncol = 10)
# 100 replicates for training/testing sets for each of the 10 values of threshold
 
p0 <- 0.73
p1 <- 1-p0

j <- 0

# This is for doing hyper-parameter
for (threshold in seq(0.05,0.5,0.05)){
  j <- j + 1
  cat(j)
  for(i in 1:50){
    
    # partition data intro training (40%) and testing sets (60%)
    d <- createDataPartition(training_transf$Churn, p = 0.4, list = FALSE)
    # select training sample
    
    train <- training_transf[d,]
    test <- training_transf[-d,]  
    
    # Fit logistic regression model
    full_model <- glm(Churn ~ ., data = training_transf, family = "binomial")
    
    # Obtain predicted probabilities on test set
    probabilities <- predict(full_model, test, type = "response")

    # Convert probabilities to predicted classes based on threshold
    Churn_pred <- ifelse(probabilities > threshold, "Yes", "No")
    
    # Create confusion matrix
    CM <- table(Churn_pred, test$Churn)

    # Calculate profit per applicant
    profit.applicant <- sum(profit_unit*CM)/sum(CM)
    profit_i[i,j] <- profit.applicant
    
  }
}
## 12345678910
boxplot(profit_i, main = "Hyper-parameter selection",
        ylab = "unit profit",
        xlab = "threshold", names = seq(0.05,0.5,0.05),col="royalblue2")

The width of the boxes on the boxplot represent the variability of the unit profit across the different tries of the hyper-parameter tuning. A smaller width means a more reliable model’s performance. Therefore, in this situation, for the Telco company, the optimal threshold is \(0.05\).

Finally, once we have computed the optimal threshold we can use it in order to train our model and make predictions over the test set. This will turn into a worst accuracy model but our economic profit will increased.

## Final prediction for testing set using the optimal hyper-parameter:
best_threshold <- 0.05
# Fit the logistic regression model using the training data
final_model <- train(Churn ~ .,
                     data = training_transf, 
                     method = "glm", 
                     family = "binomial", 
                     trControl = train_ctrl)

# Predict probabilities on the test set
probabilities_post <- predict(model_LR, newdata = testing_transf, type = "prob")

# Make final predictions using the best threshold
final_predictions <- ifelse(probabilities_post[, "Yes"] > best_threshold, 
                            "Yes", "No")

# Confusion matrix
confusionMatrix(factor(final_predictions), testing_transf$Churn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  376   6
##        Yes 658 367
##                                           
##                Accuracy : 0.5281          
##                  95% CI : (0.5016, 0.5544)
##     No Information Rate : 0.7349          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.223           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.3636          
##             Specificity : 0.9839          
##          Pos Pred Value : 0.9843          
##          Neg Pred Value : 0.3580          
##              Prevalence : 0.7349          
##          Detection Rate : 0.2672          
##    Detection Prevalence : 0.2715          
##       Balanced Accuracy : 0.6738          
##                                           
##        'Positive' Class : No              
## 
CM <- confusionMatrix(factor(final_predictions), testing_transf$Churn)$table


profit.applicant <- sum(profit_unit*CM)/sum(CM)
sprintf("Profit per customer (with optimal threshold): %f", profit.applicant)
## [1] "Profit per customer (with optimal threshold): 0.052331"

As our initial though the profit of the company increase but we create a worse model regarding to the accuracy. However, due to the importance of obtaining benefits, this model even though it makes more mistakes it is much beneficial for the business. We have been able to decrease to \(6\) the worst errors.

5 Conclusion

This project is focus on analyzing customer churn for a telecommunications company, the dataset contains \(21\) variables such as gender, total charges, and subscription services, among others. The first step is based on cleaning and manipulated the data, which included deleting variables due to its low influence and some transformations in order to approach a normal distribution.

The statistical models assessed included Quadratic Discriminant Analysis (QDA), Linear Discriminant Analysis (LDA), Naïve Bayes (NB), and Logistic Regression (LR). Among all models, the logistic regression classifier has demonstrated to be the most accurate techinque, reaching an accuracy of \(81.66\)%. This good performance can be explained thanks to the model’s capacity to handle binary outcomes and its robustness to small sample sizes, making it a robust tool for this dataset.

Last part of the project include the economic impact of our model based on the prediction errors, demonstrating that not all errors hold the same consequences for the company point of view. For instance, the “worst” mistake is predicting the client will stay having the company services but they actually cancel them. By incorporating the economic impact into the model evaluation, it became clear that a slightly less accurate model could yield higher profits by minimizing those “worst” and costly mistakes.

In conclusion, the project demonstrate the importance of selecting the right model not only based on the statistical accuracy also by understanding the economic impact. The right decision will lead us into a model less accurate which maximize our profit. And it can be done easly by adjusting the threshold for predicting churn to find the balance of these two situation.

############ Summary ACCURACY ############
## QDA -----------------------------------------------
# Accuracy: 0.770434 and Worst error: 86
## LDA -----------------------------------------------
# Accuracy: 0.813077 and Worst error: 163
## NB ------------------------------------------------
# Accuracy: 0.7896 and Worst error: 108
## LR ------------------------------------------------
# Accuracy: 0.8166 and Worst error: 168
## Incorporing economic impact -----------------------
# Accuracy: 0.5281 and Worst error: 6

6 References

Auguie, Baptiste. 2017. gridExtra: Miscellaneous Functions for "Grid" Graphics. https://CRAN.R-project.org/package=gridExtra.
Daróczi, Gergely, and Roman Tsegelskyi. 2022. Pander: An r ’Pandoc’ Writer. https://CRAN.R-project.org/package=pander.
Fox, John. 2022. Polycor: Polychoric and Polyserial Correlations. https://CRAN.R-project.org/package=polycor.
Kowarik, Alexander, and Matthias Templ. 2016. “Imputation with the R Package VIM.” Journal of Statistical Software 74 (7): 1–16. https://doi.org/10.18637/jss.v074.i07.
Kuhn, and Max. 2008. “Building Predictive Models in r Using the Caret Package.” Journal of Statistical Software 28 (5): 1–26. https://doi.org/10.18637/jss.v028.i05.
Pedersen, Thomas Lin. 2023. Patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Robin, Xavier, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez, and Markus Müller. 2011. “pROC: An Open-Source Package for r and s+ to Analyze and Compare ROC Curves.” BMC Bioinformatics 12: 77.
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2021. GGally: Extension to ’Ggplot2’. https://CRAN.R-project.org/package=GGally.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.
“Telco Customer Churn.” n.d. https://www.kaggle.com/datasets/blastchar/telco-customer-churn.
van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate Imputation by Chained Equations in r.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2022. Skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.
Wei, Taiyun, and Viliam Simko. 2021. R Package ’Corrplot’: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
Wickham, Hadley. 2007. “Reshaping Data with the reshape Package.” Journal of Statistical Software 21 (12): 1–20. http://www.jstatsoft.org/v21/i12/.
———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
William Revelle. 2023. Psych: Procedures for Psychological, Psychometric, and Personality Research. Evanston, Illinois: Northwestern University. https://CRAN.R-project.org/package=psych.
Xie, Yihui. 2023. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

7 Annex: fixing unbalanced dataset

This section contains the code related for balancing the dataset. The most promising method is downsampling. This approach avoids the need to “make up” values by reducing the majority class.

For all this models, we configure a common train control parameter.

# Solve the unbalanced of my dataset

# Set up train control -> 5 repeats of 10-fold cross validation
# Resolve class imbalances -> sampling = "down"
train_ctrl_balanced <- trainControl(method = "repeatedcv",
                                    repeats = 5,
                                    number = 10,
                                    sampling = "down")

7.1 QDA

# Train the QDA model using caret
model_QDA_b <- train(Churn ~ .,    
                    data = training_transf,
                    method = "qda",
                    trControl = train_ctrl_balanced) # Use the defined train control
print(model_QDA_b)

# Predictions
predictions_QDA_b <- predict(model_QDA_b, X_test)
head(predictions_QDA_b)

# Posterior probabilities
post_prob_QDA_b <- predict(model_QDA_b, X_test, type = "prob")
head(post_prob_QDA_b)

# Performance measures
confMat_QDA_b <- table(predictions_QDA_b, testing_transf$Churn)
addmargins(confMat_QDA_b)

# Error of the model
error_QDA_b <- (n_test - sum(diag(confMat_QDA_b))) / n_test
error_QDA_b

7.2 LDA

# Ensure the correct train control is used for the balanced LDA model
model_LDA_b <- train(Churn ~ .,    
                    data = training_transf,
                    method = "lda",
                    trControl = train_ctrl_balanced) # Correctly use the balanced train control
print(model_LDA_b)

# Predictions for the balanced LDA model
predictions_LDA_b <- predict(model_LDA_b, X_test)
head(predictions_LDA_b)

# Posterior probabilities for the balanced LDA model
post_prob_LDA_b <- predict(model_LDA_b, X_test, type = "prob")
head(post_prob_LDA_b)

# Performance measures for the balanced LDA model
confMat_LDA_b <- table(predictions_LDA_b, testing_transf$Churn)
addmargins(confMat_LDA_b)

# Calculate the error rate for the balanced LDA model
error_LDA_b <- (n_test - sum(diag(confMat_LDA_b))) / n_test
error_LDA_b

7.3 Naïve Bayes

# Train Naive Bayes model
model_NB_b <- train(x = X_train, 
                  y = y_train, 
                  method = "naive_bayes", 
                  trControl = train_ctrl_balanced,
                  tuneLength = data.frame(laplace = 0.5,
                                      usekernel = FALSE,
                                      adjust = FALSE))


# Predict on test data
predictions_NB_b <- predict(model_NB_b, newdata = X_test)
head(predictions_NB_b)

# Posterior probabilities
post_prob_NB_b <- predict(model_NB_b, X_test, type = "prob")
head(post_prob_NB_b)

# Evaluate the model using confusion matrix
confusionMatrix(predictions_NB_b, y_test)

# Performance measures -> withou using caret
confMat_NB_b <- table(predictions_NB_b, testing_transf$Churn)
addmargins(confMat_NB_b)

# Error of the model
error_NB_b <- (n_test - sum(diag(confMat_NB_b))) / n_test
error_NB_b

7.4 Logistic regression

model_LR_b <- train(Churn ~.,
                  data = training_transf,
                  method = "glm",
                  family = "binomial", # For binary logistic regression
                  trControl = train_ctrl_balanced)

# Print the model summary
print(model_LR_b)

# Predict on test data
predictions_LR_b <- predict(model_LR_b, X_test)
head(predictions_LR_b)

# Posterior probabilities
post_prob_LR_b <- predict(model_LR_b, X_test, type = "prob")
head(post_prob_LR_b)

# Evaluate the model using confusion matrix
confusionMatrix(predictions_LR_b, y_test)

# Performance measures -> without using caret
confMat_LR_b <- table(predictions_LR_b, testing_transf$Churn)
addmargins(confMat_LR_b)

# Error of the model
error_LR_b <- (n_test - sum(diag(confMat_LR_b))) / n_test
error_LR_b

Comparative balanced models

############ Summary ACCURACY ############
## QDA -----------------------------------------------
# Accuracy: 0.7349524
## LDA -----------------------------------------------
# Accuracy: 0.7404545
## NB ------------------------------------------------
# Accuracy: 0.7896
## LR ------------------------------------------------
# Accuracy: 0.7455974